Journal of Bioinformatics and Systems Biology — Latest Matching Preprints

1

MenDEL: automated search of BAC sets covering long DNA regions of interest

German, S.; Pinglay, S.; Camellato, B.; Fenyo, D.; Boeke, J. D.

2022-06-29 bioinformatics 10.1101/2022.06.26.496179 medRxiv

Top 0.1%

5.5%

Show abstract

MotivationSynthetic genomics as a field seeks to synthesize large regions of genomes from the ground up. Such large-scale projects, especially in complex genomes can rely on pre-existing BAC (Bacterial Artificial Chromosome) libraries as starting material to reduce cost. However, choosing BACs that cover long DNA regions, especially those that require many BACs, is a manual, idiosyncratic, time consuming, and error prone process. Automating this work would make the assembly of large DNA constructs more efficient. ResultsWe have developed MenDEL - a web-based DNA design application, that provides efficient tools for finding BACs that cover long regions of interest and allow for sorting results based on multiple user defined criteria - total length, number of BACs, longest BAC. etc. In addition, it enables the user to find a combination of BACs from pre-existing libraries that cover a region of interest not found in any single BAC. Availability and ImplementationMenDEL application is available to registered users at https://mendel-isg.nyumc.org, Java code used in the application to find BAC sets is available at https://github.com/MendelProject/BACFinder

2

VCFCons: a versatile VCF-based consensus sequence generator for small genomes

Tseng, E.; Zeng, Q.; Iyer, L.

2021-02-27 bioinformatics 10.1101/2021.02.26.433111 medRxiv

Top 0.1%

4.9%

Show abstract

We had developed VCFCons to address urgent need for a robust consensus sequence generator for SARS-CoV-2 viral surveillance, which presented several unique requirements, including: (a) low coverage areas should be noted with Ns, (b) low frequency or suspicious variant calls need to be filtered. We have found that, while some existing tools such as bcftools can generate the desired consensus sequence, it required multiple filtering steps and additional scripting. VCFCons can generate consensus sequences based on variant calls in a VCF format with versatile filtering criteria based on coverage and estimated variant frequency. We applied VCFCons to the Labcorp SARS-CoV-2 sequencing data and showed that it generated correct consensus sequences that were successfully submitted to GISAID and NCBI. We hope the community will find value in this tool and aim to continue developing VCFCons to handle more complex viral data in the future.

3

The Tiling Algorithm - A general method for structural characterization of accurate long DNA sequence reads: application to AAV genome sequences.

Bruccoleri, R. E.; Rouleau, D.; Slater, C.; Lata, D.; Phillion, C.; Adjei, S.; Adhikari, K.; Dollive, S.

2025-07-31 bioinformatics 10.1101/2025.07.25.666743 medRxiv

Top 0.1%

4.5%

Show abstract

Adeno-associated virus (AAV), a common vector used in human gene therapy, is a challenging organism for DNA sequencing because its replication cycle results in structural rearrangements. In addition, the AAV manufacturing process can produce small fractions of viral particles containing host cell DNA and/or fragments of the helper plasmids. Pacific Biosciences (PacBio) long-read sequencers are capable of full length, single molecule sequencing of AAV viral genomes, but the analysis of the data is challenging. We present a simple algorithm for determining the arrangement of functional elements of single DNA molecules which can be aggregated to provide a sensitive measure of the population of sequences in a sample, including minor species. Using data from four publicly available datasets, we demonstrate our algorithm is able to characterize nearly all of the species in the AAV samples.

4

DNA Storage Designer: A practical and holistic design platform for storing digital information in DNA sequence

Jiang, L.; Zou, Z.; Ruan, X.; Zhang, X.; Yu, X.; Lan, Y.; Liu, X.

2023-07-12 bioinformatics 10.1101/2023.07.11.548641 medRxiv

Top 0.1%

3.6%

Show abstract

DNA molecules, as natural information carriers, have several benefits over conventional digital storage mediums, including high information density and long-term durability. It is expected to be a promising candidate for information storage. However, despite significant research in this field, the pace of development has been slow due to the lack of complete encoding-decoding platform and simulaton-evaluation system. And the mutation in DNA sequences during synthesis and sequencing requires multiple experiments, and wet experiments can be costly. Thus, a silicon-based simulation platform is urgently needed for promoting research. Therefore, we proposed DNA Storage Designer, the first online platform to simulate the whole process of DNA storage experiments. Our platform offers classical and novel technologies and experimental settings that simulate three key processes: encoding, error simulation, and decoding for DNA storage system. Fisrt, 8 mainstream encoding methods were embedded in the encoding process to convert files to DNA sequences. Secondly, to uncover potential mutations and sequence distribution changes in actual experiments we integrate the simulation setting for five typical experiment sub-processes (synthesis, decay, PCR, sampling, and sequencing) in the error simulation stage. Finally, the corresponding decoding process realizes the conversion of DNA sequence to binary sequence. All the above simulation processes correspond to an analysis report will provide guides for better experiment design for researchers convenience. In short, DNA Storage Designer is an easy-to-use and automatic web-server for simulating DNA storage experiments, which could advance the development of DNA storage-related research. And it is freely available for all users at: https://dmci.xmu.edu.cn/dna/. Author summaryDNA storage technology is an emerging and promising storage technology. At the same time, DNA storage is an interdisciplinary technology that requires researchers to know both computer cryptography and biological experiments knowledge. However, DNA storage experiments are costly and lengthy, many studies have been prevented by the lack of a comprehensive design and evaluation platform to guide DNA storage experiments. Herein, we introduce DNA Storage Designer, the first integrated and practical web server for providing the simulation of the whole process of DNA storage application, from encoding, error simulation during preservation, to decoding. In the encoding process, we not only provided the coding DNA sequences but also analyzed the sequence stability. In the error simulation process, we simulated as many experimental situations as possible, such as different mutation probabilities of DNA sequences due to being stored in different bacteria hosts or different sequencing platforms. The platform provides high freedom in that users could not only encode their files and conduct the entire operation but also could upload FASTA files and only simulate the sustaining process of sequences and imitate the mutation errors together with distribution changes of sequences.

5

Histopathological characteristics of different parts of surgical specimens of UPJ stenosis

Tuerdi, N.; Bai, J.; Liu, K.; Liu, S.; Xiong, H.; Gao, Y.; Li, J.

2022-09-16 developmental biology 10.1101/2022.09.13.507884 medRxiv

Top 0.1%

3.5%

Show abstract

ObjectiveTo observe the pathological changes of different anatomical sites in the specimens obtained from ureteropelvic junction obstruction (UPJO) caused by UPJ stenosis. Materials and MethodsA total of 34 cases of UPJO were performed. The lesion of the ureteropelvic junction was visualized as the center, the 1.5cm renal pelvis segment was taken along the upper edge of the lesion segment (as the control group). Along the lower edge of the ureteral stricture, 1cm of ureteral stricture tissue was taken downward. Different dyeing methods were used to observe the tissue arrangement of different parts, the ratio of muscle to fiber tissue and the distribution of interstitial cells of Cajal(ICC) . ResultsThe number of ICCs in the lesion segment was reduced or even absent, disordered arrangement of muscle tissue was seen, the fibrous tissue proliferated to varying degrees. The pathological changes were statistically different from those of normal segment and ureteral stricture segment. ConclusionThe decrease of ICCs cells and the degree of tissue fibrosis are closely related to the pathogenesis and disease progression of UPJO caused by UPJ stenosis.

6

The comparison of single-cell RNA sequencing platforms based droplets

Zhong, Y.; Wang, L.

2024-06-17 cell biology 10.1101/2024.06.16.599202 medRxiv

Top 0.1%

3.4%

Show abstract

Single-cell sequencing enables to reveal cellular heterogeneity and discover new cellular subpopulations. In terms of the strategy of single-cell sequencing, the main methods are based with combinatorial index, microwell and microfluidic. Due to the simplicity, methods based droplets are widely used for single-cell sequencing for multi-omics. Therefore, in order to facilitate researchers to choose a suitable platform to meet their application scenarios, we compared several commercial platforms: the Chromium X platform of 10x Genomics, the MobiNova-100 platform of MobiDrop, the SeekOne platform of SeekGene, and the C4 platform of BGI. Based the comprehensive assessment of the data analysis, the Chromium X platform shows a excellent performance, closely followed by MobiNova-100 platform. One-Sentence SummaryAs droplet-based single-cell sequencing platforms, Chromium X and MobiNova-100 have comparable data performance.

7

Two Color Single Molecule Sequencing on GenoCare 1600 Platform to Facilitate Clinical Applications

Chen, F.; Liu, B.; Chen, M.; Jiang, Z.; Zhou, Z.; Wu, P.; Zhang, M.; Jin, H.; Li, L.; Lu, L.; Wang, Q.; Shang, H.; Xie, B.; Liu, L.; Lin, X.; Chen, W.; Xu, J.; Sun, R.; Wang, G.; Zheng, J.; Qi, J.; Yang, B.; Chen, D.; Zeng, L.; Li, G.; Li, Y.; Lv, H.; Zhao, N.; Zhou, B.; Wang, W.; Cai, J.; Liu, S.; Luo, W.; Zhang, J.; Zhang, Y.; Lu, Y.; Fan, J.; Dan, H.; He, X.; Liu, L.; Feng, Y.; Chen, J.; Huang, W.; Sun, L.; Yan, Q.

2020-09-29 genetic and genomic medicine 10.1101/2020.09.28.20203455 medRxiv

Top 0.1%

3.4%

Show abstract

With the rapid development of precision medicine industry, DNA sequencing becomes increasingly important as a research and diagnosis tool. For clinical applications, medical professionals require a platform which is fast, easy to use, and presents clear information relevant to definitive diagnosis. We have developed a single molecule desktop sequencing platform, GenoCare 1600. Fast library preparation (without amplification) and simple instrument operation make it friendlier for clinical use. Here we presented sequencing data of E. coli sample from GenoCare 1600 with consensus accuracy reaches 99.99%. We also demonstrated sequencing of microbial mixtures and COVID-19 samples from throat swabs. Our data show accurate quantitation of microbial, sensitive identification of SARS-CoV-2 virus and detection of variants confirmed by Sanger sequencing.

8

RealSeq2: a software integrated with UMI identification, error correction, and methylation modifications storing

Wang, K.; Song, M.; Li, M.; Cui, T.; Liu, Z.; Yu, E.; Fang, H.; Gao, X.; Xia, X.; Wang, J.; Guan, Y.; Liu, T.; Yi, X.

2023-05-18 bioinformatics 10.1101/2023.05.16.539668 medRxiv

Top 0.1%

3.3%

Show abstract

High-throughput UMI technology sequencing is widely used in early tumor screening, detection, recurrence monitoring, etc. Detecting extremely low-frequency mutations is especially important for monitoring tumor recurrence, so high-precision data, as well as high-quality data, are required. We developed RealSeq2, a new integrated data-preprocessing software based on fastp and gencore, to achieve adapter removal, quality control, UMI identification, and generate consensus reads by clustering and error correction using multithreading in high-throughput next-generation sequencing background. RealSeq2 also supports methylation data of 5-methylcytosine bisulfite-free sequencing. RealSeq2 defined a new tag in SAM for storing methylation information, which is beneficial for co-identifying methylation sites and mutation sites for downstream analysis. RealSeq2 includes three submodules: ReadsProfiler, ReadsCleaner, and ReadsRecycler. In addition, the output format file (BAM or SAM) is universal for downstream analyses. RealSeq2 is the preferred upstream analysis software for the co-detection of ultra-low frequency mutations and bisulfite-free methylation data. The error profile provides data support for downstream analysis. Additionally, XM tags will become a standard protocol for recording methylation signals.

9

TGAC Browser: an open-source genome browser for non-model organisms

Thanki, A. S.; Bian, X.; Davey, R. P.

2019-06-24 bioinformatics 10.1101/677658 medRxiv

Top 0.1%

3.3%

Show abstract

Genome browsers play a vital role to provide visualisation for genomic data. It is often the case that bespoke genome browser customisations are required between different research groups, with an obvious necessity to update, upgrade and tailor tracks and features on a potentially frequent basis. However, most of the current genome browsers require highly curated data held in public repositories. Besides, these genome browsers often rely on particular dependencies, where writing plug-in or modifying existing code can be troublesome and resource expensive.\n\nWe present TGAC Browser, a new open-source web-based genome browser designed to overcome shortcomings in available approaches. It uses a locally installed Ensembl Core Database schema and is also able to visualise data from well-known NGS data formats. We also added simple analysis functionality to perform BLAST searches within TGAC Browser. TGAC Browser also allows uploading your genomic data. TGAC Browser is an open-source, easy to set up, and user-friendly genome browser with minimal, lightweight configuration details.

10

s-aligner: a greedy algorithm for non-greedy de novo genome assembly

Bermudez, J.

2021-02-02 bioinformatics 10.1101/2021.02.02.429443 medRxiv

Top 0.1%

3.2%

Show abstract

Genome assembly is a fundamental tool for biological research. Particularly, in microbiology, where budgets per sample are often scarce, it can make the difference between an inconclusive result and a fully valid conclusion. Identifying new strains or estimating the relative abundance of quasi-species in a sample are some example tasks that cant be properly accomplished without previously generating assemblies with little structure ambiguity and covering most of the genome. In this work, we present a new genome assembly tool based on a greedy strategy. We compare the results obtained applying this tool to the results obtained with previously existing software. We find that, when applied to viral studies, comparatively, the software we developed often gets far larger contigs and higher genome fraction coverage than previous software. We also find a significant advantage when applied to exceptionally large virus genomes.

11

JBrowse Jupyter: A Python interface to JBrowse 2

Martinez, T. D. J.; Hershberg, E.; Guo, E.; Stevens, G. J.; Diesh, C.; Xie, P.; Bridge, C.; Cain, S.; Haw, R.; Buels, R. M.; Stein, L. D.; Holmes, I. H.

2022-05-16 bioinformatics 10.1101/2022.05.11.491552 medRxiv

Top 0.1%

3.2%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWO_ST_ABSMotivationC_ST_ABSJBrowse Jupyter is a package that aims to close the gap between Python programming and genomic visualization. Web-based genome browsers are routinely used for publishing and inspecting genome annotations. Historically they have been deployed at the end of bioinformatics pipelines, typically decoupled from the analysis itself. However, emerging technologies such as Jupyter notebooks enable a more rapid iterative cycle of development, analysis and visualization. ResultsWe have developed a package that provides a python interface to JBrowse 2s suite of embeddable components, including the primary Linear Genome View. The package enables users to quickly set up, launch and customize JBrowse views from Jupyter notebooks. In addition, users can share their data via Googles Colab notebooks, providing reproducible interactive views. AvailabilityJBrowse Jupyter is released under the Apache License and is available for download on PyPI. Source code and demos are available on GitHub at https://github.com/GMOD/jbrowse-jupyter. Contactihh@berkeley.edu

12

ColabPCR: A validated Google Colaboratory Notebook for Reproducible and Precise Primer Design.

Lozano, M. J.; Dallachiesa, D.; Beihammer, G.; Schwestka, J.; Koenig-Beihammer, J.; Pena, E. J.

2025-12-26 bioinformatics 10.64898/2025.12.23.696274 medRxiv

Top 0.1%

3.2%

Show abstract

The Polymerase Chain Reaction (PCR) method often has a lower success rate when amplifying specific genomic regions in eukaryotic genomes. This is frequently due to the non-specific annealing of primers at various genomic locations. To address this issue, we created ColabPCR, a program specifically designed to optimize the selection and evaluation of primers for a defined genomic region. ColabPCR refines primer length and melting temperature parameters through the utilization of Primer3 software, and it quantifies the number of potential off-target binding regions within the target genome via BLASTn analysis. Furthermore, the program facilitates the integration of restriction enzyme recognition sites at the 5 termini of primers and provides a mechanism to confirm the absence of such sites within the intended amplification region. ColabPCR centralizes all these functionalities within a single interface, utilizing Google Colabs computational resources to ensure high performance and accessibility without requiring local software installation. We rigorously validated ColabPCR by designing primers for promoter and terminator regions within the Daucus carota (DCARv2, DH1v3) reference genome. Our findings unequivocally demonstrate significantly enhanced success rates, particularly when primers exhibiting off-target binding are excluded from the primer design process. In summary, ColabPCR offers a user-friendly and powerful solution that simplifies and enhances the primer design and evaluation workflow, leading to increased accuracy and success in molecular biology experiments.

13

easyfm : An easy software suite for file manipulation of Next Generation Sequencing data on desktops

Jung, H.; Jeon, B.; Ortiz-Barrientos, D.

2021-09-30 bioinformatics 10.1101/2021.09.29.462291 medRxiv

Top 0.1%

3.1%

Show abstract

Storing and manipulating Next Generation Sequencing (NGS) file formats is an essential but difficult task in biological data analysis. The easyfm (easy file manipulation) toolkit (https://github.com/TaekAndBrendan/easyfm) makes manipulating commonly used NGS files more accessible to biologists. It enables them to perform end-to-end reproducible data analyses using a free standalone desktop application (available on Windows, Mac and Linux). Unlike existing tools (e.g. Galaxy), the Graphical User Interface (GUI)-based easyfm is not dependent on any high-performance computing (HPC) system and can be operated without an internet connection. This specific benefit allow easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences. Author summaryThe analysis and manipulation of NGS data for understanding biological phenomena is an increasingly important aspect in the life sciences. Yet, most methods for analysing, storing and manipulating NGS data require complex command-line tools in HPC or web-based servers and have not yet been implemented in comprehensive, easy-to-use software. This is a major hurdle preventing more general application in the field of NGS data analysis and file manipulation. Here we present easyfm, a free standalone Graphical User Interface (GUI) software with Python support that can be used to facilitate the rapid discovery of target sequences (or users interest) in NGS datasets for novice users. For user-friendliness and convenience, easyfm was developed with four work modules and a secondary GUI window (herein secondary window), covering different aspects of NGS data analysis (mainly focusing on FASTA files), including post-processing, filtering, format conversion, generating results, real-time log, and help. In combination with the executable tools (BLAST+ and BLAT) and Python, easyfm allows the user to set analysis parameters, select/extract regions of interest, examine the input and output results, and convert to a wide range of file formats. To help augment the functionality of existing web-based and command-line tools, easyfm, a self-contained program, comes with extensive documentation (hosted at https://github.com/TaekAndBrendan/easyfm) including a comprehensive step-by-step guide.

14

Sequencing and assembling complete plasmids on Oxford Nanopore Technology Sequencers using R2C2 and Chopper

Schimke, K. D.; Vollmers, C.

2025-01-21 bioinformatics 10.1101/2025.01.16.633418 medRxiv

Top 0.1%

3.1%

Show abstract

Plasmids are ubiquitous tools in molecular biology which are used for a large variety of experiments within academic and commercial labs. Both new and old plasmids have to undergo sequencing-based analysis to determine whether or not they are functional, i.e. contain the correct insert in the correct backbone. While traditional Sanger sequencing based analysis was most often limited to the inserts, new high-throughput sequencing based methods and services can now provide the complete sequence of a plasmid. Currently available methods and services vary in throughput and cost. Here, we adapted the Oxford Nanopore Technologies-based R2C2 sequencing method to - rapidly and at low cost - sequence complete plasmids, either individually or in a pool. We also developed an analysis pipeline, Chopper, that produces full-length plasmid sequences. We tested our workflow with commonly used plasmids we ordered from Addgene and produced highly accurate sequences for each plasmid from both their individual and pooled sequencing runs.

15

MenDEL: PCR Primer Design as Constrained Optimization Process

German, S.; Mitchell, L.; Vela Gartner, A.; Fenyo, D.; Boeke, J. D.

2022-06-29 bioinformatics 10.1101/2022.06.26.496474 medRxiv

Top 0.1%

2.9%

Show abstract

MotivationThe synthesis of large DNA assemblies has applications in biotechnology, and can help us better understand genome biology. These large DNA assemblies are often constructed from many smaller DNA segments, and it is critical to assess that they are correctly assembled. One low cost and rapid method to ensure that the connection between each segment is correct is to use PCR with primer pairs that span assembly junctions. However, the design of PCR primers for large assemblies consisting of multiple segments, and therefore containing multiple assembly junctions, is a challenging process. Rule-based automation of the process often results in finding primers that satisfy general criteria, but are not necessarily the best fit for every particular junction. ResultsWe have developed MenDEL - a web-based DNA design application, that provides a primer pair computation tool for multiple assembly junctions in such a way that for each junction we automatically pick the optimal pair of primers based on user specified constraints. Availability and ImplementationThe MenDEL application is available at https://mendel-isg.nyumc.org to registered users, and the code base for computing junction primers is available at https://github.com/MendelProject/PrimerOptimization.

16

PathoResist AI: A One-Click Web Platform for Rapid Pathogen Resistance Analysis Based on the all_ratio Algorithm

Mai, G.; Dai, Y.

2026-02-13 bioinformatics 10.64898/2026.02.12.705264 medRxiv

Top 0.1%

2.8%

Show abstract

This study introduces a one-stop analysis platform named "PathoResistAI" (https://www.resistpath.com/), which can be used to solve the technical bottlenecks of pathogenic microorganism detection and antimicrobial resistance analysis. The platform is based on nanopore sequencing and the innovative all-ratio algorithm, which integrates four-dimensional parameters (sequence similarity, abundance, matching number, and matching length), significantly improving the detection accuracy of low-abundance pathogens and drug-resistance genes. The platform adopts four layers of modular design (input layer, core engine, dual-channel output, and visualization layer). Users only need to upload data in FASTQ format, and they can obtain automated reports, including pathogen identification and drug-resistance gene prediction within 30 min. The verification results show that the platform can accurately identify bacteria (e.g., Staphylococcus aureus and Serratia marcescens), viruses (e.g., Ebola virus), and drug-resistance genes (e.g., SdeY), which are consistent with the published literature results. Limitations include only supporting long-read sequencing data, small sample size (fewer than 50 cases), and lack of real clinical sample verification. In general, this platform represents the application and exploration of nanopore sequencing in the field of rapid detection of pathogenic microorganisms, and provides a new tool for microbial pathogen or AMR detection research.

17

Ontology-based modeling, integration, and analysis of heterogeneous clinical, pathological, and molecular kidney data for precision medicine

He, Y.; Barisoni, L.; Rosenberg, A. Z.; Robinson, P.; Diehl, A. D.; Chen, Y.; Phuong, J. P.; Hansen, J.; Herr, B. W.; Borner, K.; Schaub, J.; Bonevich, N.; Arnous, G.; Boddapati, S.; Zheng, J.; Alakwaa, F.; Sarder, P.; Duncan, W. D.; Liang, C.; Valerius, M. T.; Jain, S.; Iyengar, R.; Himmelfarb, J.; Kretzler, M.; Kidney Precision Medicine Project,

2024-04-02 bioinformatics 10.1101/2024.04.01.587658 medRxiv

Top 0.1%

2.8%

Show abstract

Many resources are now generating, processing, storing, or providing kidney-related molecular, pathological, and clinical data. Reference ontologies offer an opportunity to support knowledge and data organization and integration. The Kidney Precision Medicine Project (KPMP) team contributed to the representation and addition of 329 kidney phenotype terms to the Human Phenotype Ontology (HPO) and identified many subcategories of acute kidney injury (AKI) or chronic kidney disease (CKD). The Kidney Tissue Atlas Ontology (KTAO) imports and integrates kidney-related terms from existing ontologies (e.g., HPO, CL, and Uberon) and represents 259 kidney-related biomarkers. We also developed a precision medicine metadata ontology (PMMO) to integrate 50 variables from KPMP and CellxGene resources and applied PMMO for integrative analysis. The gene expression profiles of kidney gene biomarkers were specifically analyzed in healthy controls or AKI/CKD disease states. This work demonstrates how ontology-based approaches support multi-domain data and knowledge organization and integration to advance precision medicine.

18

A Series of Composited Tumor DNA Reference Materials Containing Three Genes and Ten Mutation Positions for CNV and SNV Detection

FAN, W.; SHI, Y.; Zhang, H.; Li, C.; Zhang, J.; Su, S.; Wu, P.; Tang, M.

2023-05-05 molecular biology 10.1101/2023.05.04.538185 medRxiv

Top 0.1%

2.8%

Show abstract

Processes in clinic for tumors diagnosis and treatment need reference materials (RMs) to evaluate and calibrate. However, no RMs can provides properties of copy number variation (CNV) and single nucleotide variants (SNV) of genes EGFR, HER2, MET, PIK3CA, KRAS, BRAF, NRAS simultaneously. In this study, we used commercial cell lines to construct a series of tumor RMs containing property mentioned above. Furthermore, we evaluated their stability, homogeneity, and commutability by droplet digital PCR and next generation sequencing technology. The results showed that, for tumor CNV gDNA RM, the copy number is 7.3 copies/L (EGFR), 5.3 copies/L (HER2) and 8.2 copies/L (MET). For tumor 5% SNV gDNA RM, the mutation frequency of each mutation position showed as follow: EGFR-E746A750 (24.6%), EGFR-L858R (5.8%), EGFR-T790M (5.5%), EGFR-G719S (6.6%), PIK3CA-E545K (4.7%), PIK3CA-H1047R (5.8%), KRAS-G13D (8.2%), KRAS-G12D (6.5%), BRAF-V600E (4.6%), NRAS-Q61K (8.5%). All variable coefficient (CV) of tumor gDNA RM for homogeneity were less than 7%, that of CNV+SNV ctDNA RM were less than 17%. Besides, the CV for commutability of the all types of RMs were less than 17%. These RMs can be applied into a wide range type of sequencing panels and provides a closer simple background.

19

cycle_finder: de novo analysis of tandem and interspersed repeats based on cycle-finding

Tanaka, Y.; Kajitani, R.; Itoh, T.

2023-07-19 bioinformatics 10.1101/2023.07.17.549334 medRxiv

Top 0.1%

2.8%

Show abstract

Repeat sequences in the genome can be classified into interspersed and tandem repeats, both of which are important for understanding genome evolution and important traits such as disease. They are also noteworthy as regions of high frequency of genome rearrangement in somatic cells and high inter-individual diversity. Existing repeat detection tools have limitations in that they targets only one of the two types and/or require reference sequences. In this study, we developed a novel tool: cycle_finder, which constructs a graph structure (de Bruijn graph) from low-cost short-read data and constructs units of both types of repeats. The tool can detect cycles with branching and corresponding tandem repeats, and can also construct interspersed repeats by exploring non-cycle subgraphs. Furthermore, it can estimate sequences with large copy-number differences by using two samples as input. Benchmarking with simulations and actual data from the human genome showed that this tool had superior recall and precision values compared to existing methods. In a test on the roundworm data, in which large-scale deletions occur in somatic cells, the tool succeeded in detecting deletion sequences reported in previous studies. This tool is expected to enable low-cost analysis of repeat sequences that were previously difficult to construct.

20

COVATOR: A Software for Chimeric Coronavirus Identification

Habib, P.

2020-11-16 bioinformatics 10.1101/2020.11.14.383075 medRxiv

Top 0.1%

2.7%

Show abstract

The term chimeric virus was not popular in the last decades. Recently, according to current sequencing efforts in discovering COVID-19 Secrets, the generated information assumed the presence of 6 Coronavirus main strains, but coronavirus diverges into hundreds of sub-strains. the bottleneck is the mutation rate. With two mutation/month, humanity will meet a new sub-strain every month. Tracking new sequenced viruses is urgently needed because of the pathogenic effect of the new substrains. here we introduce COVATOR, A user-friendly and python-based software that identifies viral chimerism. COVATOR aligns input genome and protein that has no known source, against genomes and protein with known source, then gives the user a graphical summary.